Load and preprocess data

First load the required packages for this walk through.

pkg_list <- c("tidyverse", "psych", "ez", "rcompanion", "knitr", "car",
              "afex", "ggfortify", "Hmisc", "emmeans", "jtools", "apaTables")
lapply(pkg_list, library, quietly = TRUE, character.only = TRUE)
## [[1]]
##  [1] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices"
## [13] "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "psych"     "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "ez"        "psych"     "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[4]]
##  [1] "rcompanion" "ez"         "psych"      "forcats"    "stringr"    "dplyr"      "purrr"      "readr"      "tidyr"      "tibble"     "ggplot2"   
## [12] "tidyverse"  "stats"      "graphics"   "grDevices"  "utils"      "datasets"   "methods"    "base"      
## 
## [[5]]
##  [1] "knitr"      "rcompanion" "ez"         "psych"      "forcats"    "stringr"    "dplyr"      "purrr"      "readr"      "tidyr"      "tibble"    
## [12] "ggplot2"    "tidyverse"  "stats"      "graphics"   "grDevices"  "utils"      "datasets"   "methods"    "base"      
## 
## [[6]]
##  [1] "car"        "carData"    "knitr"      "rcompanion" "ez"         "psych"      "forcats"    "stringr"    "dplyr"      "purrr"      "readr"     
## [12] "tidyr"      "tibble"     "ggplot2"    "tidyverse"  "stats"      "graphics"   "grDevices"  "utils"      "datasets"   "methods"    "base"      
## 
## [[7]]
##  [1] "afex"       "lme4"       "Matrix"     "car"        "carData"    "knitr"      "rcompanion" "ez"         "psych"      "forcats"    "stringr"   
## [12] "dplyr"      "purrr"      "readr"      "tidyr"      "tibble"     "ggplot2"    "tidyverse"  "stats"      "graphics"   "grDevices"  "utils"     
## [23] "datasets"   "methods"    "base"      
## 
## [[8]]
##  [1] "ggfortify"  "afex"       "lme4"       "Matrix"     "car"        "carData"    "knitr"      "rcompanion" "ez"         "psych"      "forcats"   
## [12] "stringr"    "dplyr"      "purrr"      "readr"      "tidyr"      "tibble"     "ggplot2"    "tidyverse"  "stats"      "graphics"   "grDevices" 
## [23] "utils"      "datasets"   "methods"    "base"      
## 
## [[9]]
##  [1] "Hmisc"      "Formula"    "survival"   "lattice"    "ggfortify"  "afex"       "lme4"       "Matrix"     "car"        "carData"    "knitr"     
## [12] "rcompanion" "ez"         "psych"      "forcats"    "stringr"    "dplyr"      "purrr"      "readr"      "tidyr"      "tibble"     "ggplot2"   
## [23] "tidyverse"  "stats"      "graphics"   "grDevices"  "utils"      "datasets"   "methods"    "base"      
## 
## [[10]]
##  [1] "emmeans"    "Hmisc"      "Formula"    "survival"   "lattice"    "ggfortify"  "afex"       "lme4"       "Matrix"     "car"        "carData"   
## [12] "knitr"      "rcompanion" "ez"         "psych"      "forcats"    "stringr"    "dplyr"      "purrr"      "readr"      "tidyr"      "tibble"    
## [23] "ggplot2"    "tidyverse"  "stats"      "graphics"   "grDevices"  "utils"      "datasets"   "methods"    "base"      
## 
## [[11]]
##  [1] "jtools"     "emmeans"    "Hmisc"      "Formula"    "survival"   "lattice"    "ggfortify"  "afex"       "lme4"       "Matrix"     "car"       
## [12] "carData"    "knitr"      "rcompanion" "ez"         "psych"      "forcats"    "stringr"    "dplyr"      "purrr"      "readr"      "tidyr"     
## [23] "tibble"     "ggplot2"    "tidyverse"  "stats"      "graphics"   "grDevices"  "utils"      "datasets"   "methods"    "base"      
## 
## [[12]]
##  [1] "apaTables"  "jtools"     "emmeans"    "Hmisc"      "Formula"    "survival"   "lattice"    "ggfortify"  "afex"       "lme4"       "Matrix"    
## [12] "car"        "carData"    "knitr"      "rcompanion" "ez"         "psych"      "forcats"    "stringr"    "dplyr"      "purrr"      "readr"     
## [23] "tidyr"      "tibble"     "ggplot2"    "tidyverse"  "stats"      "graphics"   "grDevices"  "utils"      "datasets"   "methods"    "base"

Optional package

# devtools::install_github("ropensci/skimr")
library("skimr")

The data set being used for this walk through is automatically loaded when psych is loaded. You can examine what each of the columns represent by looking at the data set’s help page with ?sat.act.

Next, just to get a snapshot of the data, we can use head and str. Notice that gender and education are integer columns despite being categorical variables.

head(sat.act)
##       gender education age ACT SATV SATQ
## 29442      2         3  19  24  500  500
## 29457      2         3  23  35  600  500
## 29498      2         3  20  21  480  470
## 29503      1         4  27  26  550  520
## 29504      1         2  33  31  600  550
## 29518      1         5  26  28  640  640
str(sat.act)
## 'data.frame':    700 obs. of  6 variables:
##  $ gender   : int  2 2 2 1 1 1 2 1 2 2 ...
##  $ education: int  3 3 3 4 2 5 5 3 4 5 ...
##  $ age      : int  19 23 20 27 33 26 30 19 23 40 ...
##  $ ACT      : int  24 35 21 26 31 28 36 22 22 35 ...
##  $ SATV     : int  500 600 480 550 600 640 610 520 400 730 ...
##  $ SATQ     : int  500 500 470 520 550 640 500 560 600 800 ...

Preprocessing integers into factors

Since some of the analyses we will be running require categorical variables, we first need to preprocess some of the numeric columns into categories. Let’s quickly add two columns to the data set for factored versions of of gender and education.

sat_act <- sat.act %>%
  # mutate(., gender_fac = ifelse(gender == 1, "male", "female") %>%
  mutate(.,
    gender_fac = case_when(gender == 1 ~ "male",
                           gender == 2 ~ "female"),
    education_fac = case_when(education == 0 ~ "none",
                              education == 1 ~ "some_hs",
                              education == 2 ~ "high_school",
                              education == 3 ~ "some_college",
                              education == 4 ~ "college",
                              education == 5 ~ "graduate")
  )

Describe data

Before analysis, we can obtain descriptive statistics quickly by utilizing the describe() function in the psych package.

describe(sat_act)
## sat_act 
## 
##  8  Variables      700  Observations
## ------------------------------------------------------------------------------------------------------------------------------------------------------
## gender 
##        n  missing distinct     Info     Mean      Gmd 
##      700        0        2    0.685    1.647   0.4574 
##                       
## Value          1     2
## Frequency    247   453
## Proportion 0.353 0.647
## ------------------------------------------------------------------------------------------------------------------------------------------------------
## education 
##        n  missing distinct     Info     Mean      Gmd 
##      700        0        6    0.922    3.164    1.532 
##                                               
## Value          0     1     2     3     4     5
## Frequency     57    45    44   275   138   141
## Proportion 0.081 0.064 0.063 0.393 0.197 0.201
## ------------------------------------------------------------------------------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
##      700        0       48    0.994    25.59     9.56       17       18       19       22       29       39       46 
## 
## lowest : 13 14 15 16 17, highest: 57 58 61 62 65
## ------------------------------------------------------------------------------------------------------------------------------------------------------
## ACT 
##        n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
##      700        0       23    0.995    28.55    5.407       20       22       25       29       32       35       35 
## 
## lowest :  3 15 16 17 18, highest: 32 33 34 35 36
## ------------------------------------------------------------------------------------------------------------------------------------------------------
## SATV 
##        n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
##      700        0       70    0.997    612.2      126      400      450      550      620      700      750      780 
## 
## lowest : 200 240 245 300 330, highest: 785 790 797 799 800
## ------------------------------------------------------------------------------------------------------------------------------------------------------
## SATQ 
##        n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
##      687       13       72    0.998    610.2    129.9      400      450      530      620      700      750      777 
## 
## lowest : 200 230 250 300 320, highest: 780 790 795 799 800
## ------------------------------------------------------------------------------------------------------------------------------------------------------
## gender_fac 
##        n  missing distinct 
##      700        0        2 
##                         
## Value      female   male
## Frequency     453    247
## Proportion  0.647  0.353
## ------------------------------------------------------------------------------------------------------------------------------------------------------
## education_fac 
##        n  missing distinct 
##      700        0        6 
##                                                                                         
## Value           college     graduate  high_school         none some_college      some_hs
## Frequency           138          141           44           57          275           45
## Proportion        0.197        0.201        0.063        0.081        0.393        0.064
## ------------------------------------------------------------------------------------------------------------------------------------------------------

The describe() function also allows to descriptive statistics by a grouping variable using the partner function describeBy(). If we wanted to get descriptive statistics by gender_fac we simply pass that column to an argument.

describeBy(sat_act, group = c("gender_fac"))
## 
##  Descriptive statistics by group 
## group: female
##                vars   n   mean     sd median trimmed    mad min  max range  skew kurtosis   se
## gender            1 453   2.00   0.00      2     2.0   0.00   2    2     0   NaN      NaN 0.00
## education         2 453   3.26   1.35      3     3.4   1.48   0    5     5 -0.74     0.27 0.06
## age               3 453  25.45   9.37     22    23.7   5.93  13   65    52  1.77     3.03 0.44
## ACT               4 453  28.42   4.69     29    28.6   4.45  15   36    21 -0.39    -0.42 0.22
## SATV              5 453 610.66 112.31    620   617.9 103.78 200  800   600 -0.65     0.42 5.28
## SATQ              6 442 596.00 113.07    600   602.2 133.43 200  800   600 -0.58     0.13 5.38
## gender_fac*       7 453    NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA   NA
## education_fac*    8 453    NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA   NA
## ---------------------------------------------------------------------------------------------------------------- 
## group: male
##                vars   n  mean     sd median trimmed    mad min  max range  skew kurtosis   se
## gender            1 247   1.0   0.00      1    1.00   0.00   1    1     0   NaN      NaN 0.00
## education         2 247   3.0   1.54      3    3.12   1.48   0    5     5 -0.54    -0.60 0.10
## age               3 247  25.9   9.74     22   24.23   5.93  14   58    44  1.43     1.43 0.62
## ACT               4 247  28.8   5.06     30   29.23   4.45   3   36    33 -1.06     1.89 0.32
## SATV              5 247 615.1 114.16    630  622.07 118.61 200  800   600 -0.63     0.13 7.26
## SATQ              6 245 635.9 116.02    660  645.53  94.89 300  800   500 -0.72    -0.12 7.41
## gender_fac*       7 247   NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA   NA
## education_fac*    8 247   NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA   NA

We can add multiple grouping variables in the same manner.

describeBy(sat_act, group = c("gender_fac", "education_fac"))
## 
##  Descriptive statistics by group 
## gender_fac: female
## education_fac: college
##                vars  n  mean     sd median trimmed    mad min  max range  skew kurtosis    se
## gender            1 87   2.0   0.00      2     2.0   0.00   2    2     0   NaN      NaN  0.00
## education         2 87   4.0   0.00      4     4.0   0.00   4    4     0   NaN      NaN  0.00
## age               3 87  29.1   7.76     26    27.8   5.93  21   52    31  1.26     0.70  0.83
## ACT               4 87  29.4   4.32     30    29.6   4.45  19   36    17 -0.27    -0.67  0.46
## SATV              5 87 615.0 106.62    620   621.4  88.96 300  800   500 -0.58     0.28 11.43
## SATQ              6 86 597.6 106.24    600   605.8 118.61 300  800   500 -0.71     0.20 11.46
## gender_fac*       7 87   NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA    NA
## education_fac*    8 87   NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA    NA
## ---------------------------------------------------------------------------------------------------------------- 
## gender_fac: male
## education_fac: college
##                vars  n  mean     sd median trimmed   mad min  max range  skew kurtosis    se
## gender            1 51   1.0   0.00      1     1.0  0.00   1    1     0   NaN      NaN  0.00
## education         2 51   4.0   0.00      4     4.0  0.00   4    4     0   NaN      NaN  0.00
## age               3 51  32.2   9.03     29    30.8  8.90  23   57    34  1.20     0.63  1.27
## ACT               4 51  28.9   4.42     29    29.3  4.45  16   36    20 -0.74     0.12  0.62
## SATV              5 51 620.3  81.72    620   623.3 88.96 430  800   370 -0.26    -0.29 11.44
## SATQ              6 51 635.9 104.12    640   642.5 88.96 400  800   400 -0.46    -0.45 14.58
## gender_fac*       7 51   NaN     NA     NA     NaN    NA Inf -Inf  -Inf    NA       NA    NA
## education_fac*    8 51   NaN     NA     NA     NaN    NA Inf -Inf  -Inf    NA       NA    NA
## ---------------------------------------------------------------------------------------------------------------- 
## gender_fac: female
## education_fac: graduate
##                vars  n  mean     sd median trimmed    mad min  max range  skew kurtosis    se
## gender            1 95   2.0   0.00      2     2.0   0.00   2    2     0   NaN      NaN  0.00
## education         2 95   5.0   0.00      5     5.0   0.00   5    5     0   NaN      NaN  0.00
## age               3 95  34.3  10.67     30    32.7   8.90  22   65    43  1.18     0.61  1.09
## ACT               4 95  29.0   4.19     29    29.1   4.45  18   36    18 -0.31    -0.73  0.43
## SATV              5 95 620.4  95.72    620   623.6  74.13 300  800   500 -0.46     0.43  9.82
## SATQ              6 93 606.7 105.55    600   608.9 148.26 350  800   450 -0.14    -0.94 10.95
## gender_fac*       7 95   NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA    NA
## education_fac*    8 95   NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA    NA
## ---------------------------------------------------------------------------------------------------------------- 
## gender_fac: male
## education_fac: graduate
##                vars  n  mean    sd median trimmed    mad min  max range  skew kurtosis    se
## gender            1 46   1.0  0.00    1.0     1.0   0.00   1    1     0   NaN      NaN  0.00
## education         2 46   5.0  0.00    5.0     5.0   0.00   5    5     0   NaN      NaN  0.00
## age               3 46  35.9 10.00   35.5    35.1  11.12  22   58    36  0.47    -0.67  1.48
## ACT               4 46  30.8  3.11   32.0    30.9   2.97  25   36    11 -0.38    -0.81  0.46
## SATV              5 46 623.5 99.58  645.0   631.2  96.37 390  770   380 -0.61    -0.43 14.68
## SATQ              6 46 657.8 89.61  680.0   661.7 103.78 475  800   325 -0.45    -0.77 13.21
## gender_fac*       7 46   NaN    NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA    NA
## education_fac*    8 46   NaN    NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA    NA
## ---------------------------------------------------------------------------------------------------------------- 
## gender_fac: female
## education_fac: high_school
##                vars  n  mean     sd median trimmed    mad min  max range  skew kurtosis    se
## gender            1 21   2.0   0.00      2     2.0   0.00   2    2     0   NaN      NaN  0.00
## education         2 21   2.0   0.00      2     2.0   0.00   2    2     0   NaN      NaN  0.00
## age               3 21  30.1  12.22     26    28.4  10.38  18   57    39  1.16     0.01  2.67
## ACT               4 21  27.3   5.23     28    27.5   4.45  15   36    21 -0.32    -0.34  1.14
## SATV              5 21 593.6 115.34    600   598.2 118.61 375  770   395 -0.44    -0.91 25.17
## SATQ              6 20 586.5 120.96    585   587.8 163.09 375  800   425  0.01    -1.11 27.05
## gender_fac*       7 21   NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA    NA
## education_fac*    8 21   NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA    NA
## ---------------------------------------------------------------------------------------------------------------- 
## gender_fac: male
## education_fac: high_school
##                vars  n  mean     sd median trimmed    mad min  max range  skew kurtosis    se
## gender            1 23   1.0   0.00      1     1.0   0.00   1    1     0   NaN      NaN  0.00
## education         2 23   2.0   0.00      2     2.0   0.00   2    2     0   NaN      NaN  0.00
## age               3 23  25.3   8.68     22    23.6   4.45  18   55    37  1.94     3.63  1.81
## ACT               4 23  26.6   6.39     28    27.7   4.45   3   32    29 -2.14     5.39  1.33
## SATV              5 23 560.0 152.29    600   570.5 148.26 200  800   600 -0.53    -0.59 31.75
## SATQ              6 23 569.1 160.65    600   575.8 177.91 300  800   500 -0.36    -1.44 33.50
## gender_fac*       7 23   NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA    NA
## education_fac*    8 23   NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA    NA
## ---------------------------------------------------------------------------------------------------------------- 
## gender_fac: female
## education_fac: none
##                vars  n  mean     sd median trimmed    mad min  max range  skew kurtosis    se
## gender            1 30   2.0   0.00      2     2.0   0.00   2    2     0   NaN      NaN  0.00
## education         2 30   0.0   0.00      0     0.0   0.00   0    0     0   NaN      NaN  0.00
## age               3 30  17.0   1.07     17    17.1   0.74  13   18     5 -1.75     4.13  0.19
## ACT               4 30  26.1   5.06     26    25.9   5.93  15   36    21  0.08    -0.56  0.92
## SATV              5 30 595.3 123.46    595   597.1 148.26 350  800   450 -0.09    -0.81 22.54
## SATQ              6 29 599.7 123.20    600   601.0 148.26 333  800   467 -0.09    -0.99 22.88
## gender_fac*       7 30   NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA    NA
## education_fac*    8 30   NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA    NA
## ---------------------------------------------------------------------------------------------------------------- 
## gender_fac: male
## education_fac: none
##                vars  n  mean     sd median trimmed    mad min  max range  skew kurtosis    se
## gender            1 27   1.0   0.00      1     1.0   0.00   1    1     0   NaN      NaN  0.00
## education         2 27   0.0   0.00      0     0.0   0.00   0    0     0   NaN      NaN  0.00
## age               3 27  16.9   1.04     17    17.0   1.48  14   18     4 -0.86     0.34  0.20
## ACT               4 27  29.0   5.00     29    29.2   5.93  20   36    16 -0.30    -1.13  0.96
## SATV              5 27 640.1 132.24    670   646.2 177.91 400  800   400 -0.29    -1.40 25.45
## SATQ              6 27 642.7 127.90    660   647.9 177.91 400  800   400 -0.24    -1.36 24.61
## gender_fac*       7 27   NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA    NA
## education_fac*    8 27   NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA    NA
## ---------------------------------------------------------------------------------------------------------------- 
## gender_fac: female
## education_fac: some_college
##                vars   n  mean     sd median trimmed    mad min  max range  skew kurtosis   se
## gender            1 195   2.0   0.00      2     2.0   0.00   2    2     0   NaN      NaN 0.00
## education         2 195   3.0   0.00      3     3.0   0.00   3    3     0   NaN      NaN 0.00
## age               3 195  21.1   4.75     20    20.0   1.48  17   46    29  3.41    12.83 0.34
## ACT               4 195  28.2   4.78     29    28.4   4.45  16   36    20 -0.46    -0.47 0.34
## SATV              5 195 610.0 119.78    620   619.6 118.61 200  800   600 -0.81     0.66 8.58
## SATQ              6 190 590.9 114.46    600   598.9 118.61 200  800   600 -0.72     0.38 8.30
## gender_fac*       7 195   NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA   NA
## education_fac*    8 195   NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA   NA
## ---------------------------------------------------------------------------------------------------------------- 
## gender_fac: male
## education_fac: some_college
##                vars  n  mean     sd median trimmed    mad min  max range  skew kurtosis    se
## gender            1 80   1.0   0.00      1     1.0   0.00   1    1     0   NaN      NaN  0.00
## education         2 80   3.0   0.00      3     3.0   0.00   3    3     0   NaN      NaN  0.00
## age               3 80  20.8   3.06     20    20.3   1.48  17   34    17  2.00     4.55  0.34
## ACT               4 80  28.6   5.03     30    28.8   5.19  17   36    19 -0.45    -0.92  0.56
## SATV              5 80 617.4 111.79    630   624.5 111.19 300  800   500 -0.62    -0.06 12.50
## SATQ              6 79 642.6 118.28    680   653.1 118.61 300  800   500 -0.81    -0.17 13.31
## gender_fac*       7 80   NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA    NA
## education_fac*    8 80   NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA    NA
## ---------------------------------------------------------------------------------------------------------------- 
## gender_fac: female
## education_fac: some_hs
##                vars  n  mean     sd median trimmed    mad min  max range  skew kurtosis    se
## gender            1 25   2.0   0.00      2     2.0   0.00   2    2     0   NaN      NaN  0.00
## education         2 25   1.0   0.00      1     1.0   0.00   1    1     0   NaN      NaN  0.00
## age               3 25  19.3   4.62     18    18.1   0.00  17   37    20  2.86     7.27  0.92
## ACT               4 25  28.1   5.13     27    28.3   4.45  18   36    18 -0.21    -0.78  1.03
## SATV              5 25 597.0 119.38    610   600.8 133.43 350  799   449 -0.31    -0.95 23.88
## SATQ              6 24 592.5 140.83    625   606.6 111.19 230  799   569 -0.93     0.20 28.75
## gender_fac*       7 25   NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA    NA
## education_fac*    8 25   NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA    NA
## ---------------------------------------------------------------------------------------------------------------- 
## gender_fac: male
## education_fac: some_hs
##                vars  n  mean     sd median trimmed    mad min  max range  skew kurtosis    se
## gender            1 20   1.0   0.00      1     1.0   0.00   1    1     0   NaN      NaN  0.00
## education         2 20   1.0   0.00      1     1.0   0.00   1    1     0   NaN      NaN  0.00
## age               3 20  19.6   6.12     18    18.2   0.00  17   45    28  3.55    11.78  1.37
## ACT               4 20  26.7   7.11     28    27.1   8.15  15   35    20 -0.30    -1.51  1.59
## SATV              5 20 603.0 141.24    600   611.2 185.32 300  780   480 -0.39    -1.12 31.58
## SATQ              6 19 625.8  95.87    650   630.9  88.96 400  765   365 -0.66    -0.47 21.99
## gender_fac*       7 20   NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA    NA
## education_fac*    8 20   NaN     NA     NA     NaN     NA Inf -Inf  -Inf    NA       NA    NA

Optional While describe is useful for quick glimpses at the data, it does not always play nicely with the tidyverse. If you wanted to stick with a more “pure” tidyverse approach to descriptive statistics, you can use skimr.

sat_act %>% 
  skimr::skim(.)
## Skim summary statistics
##  n obs: 700 
##  n variables: 8 
## 
## ── Variable type:character ─────────────────────────────────────────────────────────────────────────────────────────────────────
##       variable missing complete   n min max empty n_unique
##  education_fac       0      700 700   4  12     0        6
##     gender_fac       0      700 700   4   6     0        2
## 
## ── Variable type:integer ───────────────────────────────────────────────────────────────────────────────────────────────────────
##   variable missing complete   n   mean     sd  p0 p25 p50 p75 p100     hist
##        ACT       0      700 700  28.55   4.82   3  25  29  32   36 ▁▁▁▁▃▅▇▇
##        age       0      700 700  25.59   9.5   13  19  22  29   65 ▇▇▂▂▁▁▁▁
##  education       0      700 700   3.16   1.43   0   3   3   4    5 ▂▁▁▁▇▁▅▅
##     gender       0      700 700   1.65   0.48   1   1   2   2    2 ▅▁▁▁▁▁▁▇
##       SATQ      13      687 700 610.22 115.64 200 530 620 700  800 ▁▁▁▅▃▇▇▅
##       SATV       0      700 700 612.23 112.9  200 550 620 700  800 ▁▁▁▃▃▇▅▃

While skim() doesn’t provide as much descriptive detail as describe, it does provide the basics, and a useful visual representation of the data.

Similarly, for obtaining descriptives by grouping variables you can utilize the group_by() function.

sat_act %>% 
  group_by(., gender_fac) %>% 
  skimr::skim(.)
## Skim summary statistics
##  n obs: 700 
##  n variables: 8 
##  group variables: gender_fac 
## 
## ── Variable type:character ─────────────────────────────────────────────────────────────────────────────────────────────────────
##  gender_fac      variable missing complete   n min max empty n_unique
##      female education_fac       0      453 453   4  12     0        6
##        male education_fac       0      247 247   4  12     0        6
## 
## ── Variable type:integer ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  gender_fac  variable missing complete   n   mean     sd  p0 p25 p50   p75 p100     hist
##      female       ACT       0      453 453  28.42   4.69  15  25  29  32     36 ▁▁▂▅▇▅▇▆
##      female       age       0      453 453  25.45   9.37  13  19  22  28     65 ▆▇▃▂▁▁▁▁
##      female education       0      453 453   3.26   1.35   0   3   3   4      5 ▁▁▁▁▇▁▃▃
##      female    gender       0      453 453   2      0      2   2   2   2      2 ▁▁▁▇▁▁▁▁
##      female      SATQ      11      442 453 596    113.07 200 500 600 683    800 ▁▁▁▅▃▇▆▃
##      female      SATV       0      453 453 610.66 112.31 200 550 620 700    800 ▁▁▁▃▃▇▅▃
##        male       ACT       0      247 247  28.79   5.06   3  25  30  32.5   36 ▁▁▁▁▂▃▆▇
##        male       age       0      247 247  25.86   9.74  14  19  22  30     58 ▇▇▃▂▁▁▁▁
##        male education       0      247 247   3      1.54   0   2   3   4      5 ▃▂▁▂▇▁▅▅
##        male    gender       0      247 247   1      0      1   1   1   1      1 ▁▁▁▇▁▁▁▁
##        male      SATQ       2      245 247 635.87 116.02 300 570 660 720    800 ▁▂▂▃▅▅▇▆
##        male      SATV       0      247 247 615.11 114.16 200 540 630 700    800 ▁▁▂▅▃▇▇▅
sat_act %>% 
  group_by(., education_fac) %>% 
  skimr::skim(.)
## Skim summary statistics
##  n obs: 700 
##  n variables: 8 
##  group variables: education_fac 
## 
## ── Variable type:character ─────────────────────────────────────────────────────────────────────────────────────────────────────
##  education_fac   variable missing complete   n min max empty n_unique
##        college gender_fac       0      138 138   4   6     0        2
##       graduate gender_fac       0      141 141   4   6     0        2
##    high_school gender_fac       0       44  44   4   6     0        2
##           none gender_fac       0       57  57   4   6     0        2
##   some_college gender_fac       0      275 275   4   6     0        2
##        some_hs gender_fac       0       45  45   4   6     0        2
## 
## ── Variable type:integer ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  education_fac  variable missing complete   n   mean     sd  p0   p25 p50    p75 p100     hist
##        college       ACT       0      138 138  29.26   4.35  16  27    30  32      36 ▁▂▂▃▆▇▆▆
##        college       age       0      138 138  30.24   8.36  21  24    28  34.75   57 ▇▅▂▂▁▁▁▁
##        college education       0      138 138   4      0      4   4     4   4       4 ▁▁▁▇▁▁▁▁
##        college    gender       0      138 138   1.63   0.48   1   1     2   2       2 ▅▁▁▁▁▁▁▇
##        college      SATQ       1      137 138 611.85 106.71 300 565   610 690     800 ▁▂▁▃▇▅▇▃
##        college      SATV       0      138 138 616.95  97.88 300 572.5 620 680     800 ▁▁▁▃▇▇▅▃
##       graduate       ACT       0      141 141  29.6    3.95  18  27    30  32      36 ▁▁▃▆▆▅▇▅
##       graduate       age       0      141 141  34.83  10.44  22  27    33  40      65 ▇▅▅▃▁▁▂▁
##       graduate education       0      141 141   5      0      5   5     5   5       5 ▁▁▁▇▁▁▁▁
##       graduate    gender       0      141 141   1.67   0.47   1   1     2   2       2 ▃▁▁▁▁▁▁▇
##       graduate      SATQ       2      139 141 623.63 103.09 350 535   640 700     800 ▁▂▆▅▆▆▇▆
##       graduate      SATV       0      141 141 621.4   96.65 300 570   630 700     800 ▁▁▂▅▇▇▆▃
##    high_school       ACT       0       44  44  26.98   5.81   3  25    28  31      36 ▁▁▁▁▃▇▇▅
##    high_school       age       0       44  44  27.57  10.68  18  20    25  28.75   57 ▇▆▁▂▁▁▁▂
##    high_school education       0       44  44   2      0      2   2     2   2       2 ▁▁▁▇▁▁▁▁
##    high_school    gender       0       44  44   1.48   0.51   1   1     1   2       2 ▇▁▁▁▁▁▁▇
##    high_school      SATQ       1       43  44 577.21 142.18 300 450   600 700     800 ▂▃▃▃▆▂▇▃
##    high_school      SATV       0       44  44 576.02 135.43 200 487.5 600 682.5   800 ▁▁▃▃▂▇▇▂
##           none       ACT       0       57  57  27.47   5.21  15  24    28  31      36 ▁▂▅▃▇▃▃▅
##           none       age       0       57  57  16.95   1.04  13  17    17  18      18 ▁▁▁▁▃▁▇▆
##           none education       0       57  57   0      0      0   0     0   0       0 ▁▁▁▇▁▁▁▁
##           none    gender       0       57  57   1.53   0.5    1   1     2   2       2 ▇▁▁▁▁▁▁▇
##           none      SATQ       1       56  57 620.43 126.21 333 515   620 712.5   800 ▁▁▇▅▅▃▅▇
##           none      SATV       0       57  57 616.51 128.53 350 500   600 710     800 ▂▂▆▃▅▃▆▇
##   some_college       ACT       0      275 275  28.29   4.85  16  25    29  32      36 ▁▃▃▅▅▇▆▆
##   some_college       age       0      275 275  21.01   4.32  17  19    20  21      46 ▇▃▁▁▁▁▁▁
##   some_college education       0      275 275   3      0      3   3     3   3       3 ▁▁▁▇▁▁▁▁
##   some_college    gender       0      275 275   1.71   0.46   1   1     2   2       2 ▃▁▁▁▁▁▁▇
##   some_college      SATQ       6      269 275 606.07 117.76 200 525   620 700     800 ▁▁▁▅▃▇▇▅
##   some_college      SATV       0      275 275 612.13 117.36 200 540   625 700     800 ▁▁▁▃▃▇▆▅
##        some_hs       ACT       0       45  45  27.49   6.06  15  24    27  33      36 ▂▅▁▅▇▂▅▇
##        some_hs       age       0       45  45  19.47   5.27  17  18    18  18      45 ▇▁▁▁▁▁▁▁
##        some_hs education       0       45  45   1      0      1   1     1   1       1 ▁▁▁▇▁▁▁▁
##        some_hs    gender       0       45  45   1.56   0.5    1   1     2   2       2 ▆▁▁▁▁▁▁▇
##        some_hs      SATQ       2       43  45 607.26 122.8  230 545   650 700     799 ▁▁▂▂▂▇▅▂
##        some_hs      SATV       0       45  45 599.67 128.05 300 500   600 700     799 ▂▂▃▅▇▃▇▇

Chi-squared

Suppose we wanted to see if number of males or females differed by education level. One way to accomplish this is to perform a chi-squared test of independence. In R, the chisq.test() function expects a contingency table in order to produce the appropriate statistic. A contingency table provides count data by groups.

edu_gender_table <- table(sat_act$education_fac, sat_act$gender_fac)
print(edu_gender_table)
##               
##                female male
##   college          87   51
##   graduate         95   46
##   high_school      21   23
##   none             30   27
##   some_college    195   80
##   some_hs          25   20

Next, we simply pass the contingency table to the chisq.test() function.

chi_test <- chisq.test(edu_gender_table)
chi_test
## 
##  Pearson's Chi-squared test
## 
## data:  edu_gender_table
## X-squared = 20, df = 5, p-value = 0.007

It appears as though the two variables are not independent. We will examine the pairwise comparisons in a moment to get a better idea of which variables are driving these results.

We can get the observed, expected, and residuals from the saved object.

chi_test$observed
##               
##                female male
##   college          87   51
##   graduate         95   46
##   high_school      21   23
##   none             30   27
##   some_college    195   80
##   some_hs          25   20
chi_test$expected
##               
##                female male
##   college        89.3 48.7
##   graduate       91.2 49.8
##   high_school    28.5 15.5
##   none           36.9 20.1
##   some_college  178.0 97.0
##   some_hs        29.1 15.9

We could also get the chi-squared statistic from the table summary()

summary(edu_gender_table)
## Number of cases in table: 700 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 16, df = 5, p-value = 0.007

Visualizing Chi-Squared

Aside from simply printing a table, we can visualize frequencies with a mosaic plot.

vcd::mosaic(~ education_fac + gender_fac, data = sat_act, shade = TRUE, legend = TRUE)

Pairwise comparisons

If we want to compare each level with the other in our contingency table we can computed those with pairwise comparisons using rcompanion.

rcompanion::pairwiseNominalIndependence(edu_gender_table,
                                        fisher = FALSE,
                                        gtest = FALSE,
                                        correct = "holm")
##                    Comparison p.Chisq p.adj.Chisq
## 1          college : graduate 0.52600      0.6610
## 2       college : high_school 0.10400      0.2600
## 3              college : none 0.23400      0.3900
## 4      college : some_college 0.13200      0.2830
## 5           college : some_hs 0.47200      0.6610
## 6      graduate : high_school 0.02970      0.1480
## 7             graduate : none 0.07440      0.2230
## 8     graduate : some_college 0.52900      0.6610
## 9          graduate : some_hs 0.20600      0.3860
## 10         high_school : none 0.77300      0.8280
## 11 high_school : some_college 0.00398      0.0597
## 12      high_school : some_hs 0.59800      0.6900
## 13        none : some_college 0.01140      0.0855
## 14             none : some_hs 0.92500      0.9250
## 15     some_college : some_hs 0.05920      0.2220

Or with pairwise.t.test. Note that this method requires that original, numeric data.

pairwise <- pairwise.t.test(sat_act$gender, sat_act$education, p.adjust.method = "holm")
# broom::tidy(pairwise)
pairwise
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  sat_act$gender and sat_act$education 
## 
##   0    1    2    3    4   
## 1 1.00 -    -    -    -   
## 2 1.00 1.00 -    -    -   
## 3 0.12 0.53 0.04 -    -   
## 4 1.00 1.00 0.63 1.00 -   
## 5 0.53 1.00 0.22 1.00 1.00
## 
## P value adjustment method: holm

Correlations

Moving beyond categorical variables, we next test the relationship between numeric values using simple correlation.

The Pearson product moment correlation seeks to measure the linear association between two variables, \(x\) and \(y\) on a standardized scale ranging from \(r = -1 -- 1\).

The correlation of x and y is a covariance that has been standardized by the standard deviations of \(x\) and \(y\). This yields a scale-insensitive measure of the linear association of \(x\) and \(y\). For much more conceptual detail, see: https://psu-psychology.github.io/psy-597-SEM/01_correlation_regression/01_Correlation_and_Regression.html.

\[r_{XY}= \frac{s_{XY}}{s_{X} s_{Y}}\]

Correlation is done using the cor() function.

Suppose we want to see if the ACT scores increase with age.

# Covariance
# cov(sat_act$age, sat_act$ACT)

# Correlation
cor(sat_act$age, sat_act$ACT)
## [1] 0.111

A small correlation, but no test of significance. To obtain significance values, you must pass your data to cor.test(). Note that this can be done by passing x and y or using the formula method (which will be used for linear models).

# Default method
# cor.test(sat_act$age, sat_act$ACT)

# Formula method
cor.test(~ sat_act$age + sat_act$ACT, data = sat_act)
## 
##  Pearson's product-moment correlation
## 
## data:  sat_act$age and sat_act$ACT
## t = 3, df = 700, p-value = 0.003
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0367 0.1831
## sample estimates:
##   cor 
## 0.111

Visualizing Correlations

To visualize this relationship, we can pass the raw data to ggplot and get a simple regression line using stat_smooth() with the lm method.

ggplot(sat_act, aes(x=age, y=ACT)) +
  geom_jitter(width=0.1) +
  stat_smooth(method="lm", se=FALSE)

Additional Correlation Functions

You can also pass a dataframe of values to the cor function to get a simple correlation matrix (a la SPSS).

cor(sat_act[c("age","ACT","SATV","SATQ")], use = "pairwise.complete.obs")
##          age   ACT    SATV    SATQ
## age   1.0000 0.111 -0.0424 -0.0339
## ACT   0.1105 1.000  0.5611  0.5871
## SATV -0.0424 0.561  1.0000  0.6443
## SATQ -0.0339 0.587  0.6443  1.0000

Or, optionally for easier-on-the-eyes output we can use a number of specialized functions.

Hmisc::rcorr(sat_act[c("age","ACT","SATV","SATQ")] %>% as.matrix(.))
##        age  ACT  SATV  SATQ
## age   1.00 0.11 -0.04 -0.03
## ACT   0.11 1.00  0.56  0.59
## SATV -0.04 0.56  1.00  0.64
## SATQ -0.03 0.59  0.64  1.00
## 
## n
##      age ACT SATV SATQ
## age  700 700  700  687
## ACT  700 700  700  687
## SATV 700 700  700  687
## SATQ 687 687  687  687
## 
## P
##      age    ACT    SATV   SATQ  
## age         0.0034 0.2631 0.3744
## ACT  0.0034        0.0000 0.0000
## SATV 0.2631 0.0000        0.0000
## SATQ 0.3744 0.0000 0.0000
apaTables::apa.cor.table(sat_act[c("age","ACT","SATV","SATQ")])
## 
## 
## Means, standard deviations, and correlations with confidence intervals
##  
## 
##   Variable M      SD     1           2          3         
##   1. age   25.59  9.50                                    
##                                                           
##   2. ACT   28.55  4.82   .11**                            
##                          [.04, .18]                       
##                                                           
##   3. SATV  612.23 112.90 -.04        .56**                
##                          [-.12, .03] [.51, .61]           
##                                                           
##   4. SATQ  610.22 115.64 -.03        .59**      .64**     
##                          [-.11, .04] [.54, .63] [.60, .69]
##                                                           
## 
## Note. M and SD are used to represent mean and standard deviation, respectively.
## Values in square brackets indicate the 95% confidence interval.
## The confidence interval is a plausible range of population correlations 
## that could have caused the sample correlation (Cumming, 2014).
## * indicates p < .05. ** indicates p < .01.
## 

Linear models

The overall goal is to give you an introduction to conducting regression analyses or linear modeling in R.

Single-predictor (simple) regression

Let’s turn to ‘simple’ linear regression (one predictor, one outcome), then scale to multiple regression (many predictors, one outcome). The standard linear regression model is implemented by the lm function in R. The lm function uses ordinary least squares (OLS) which estimates the parameter by minimizing the squared residuals.

In simple regression, we are interested in a relationship of the form:

\[ Y = B_0 + B_1 X \]

where \(Y\) is the dependent variable (criterion) and \(X\) is the predictor (covariate). The intercept is represented by \(B0\) and the slope for the \(X\) predictor by \(B1\).

Let’s take a look at the simple case of stopping distance (braking) as a function of car speed.

ggplot(sat.act, aes(x=SATQ, y=SATV)) + 
  geom_point(color='darkblue', size = 3) + 
  geom_smooth(method=lm, se=FALSE, fullrange=TRUE, color='black', size=1.2) +
  labs(x="Math Score", y="Verbal Score")

When conducting regression, we typically try to capture linear relationships among variables. We can introduce higher-order polynomial terms (e.g., quadratic models) or splines (more flexible shapes), but this beyond the scope here.

Fortunately, this relationship looks quite linear! The higher the math score the higher the verbal score.

In R regression models, we use the ~ operator to denote ‘regressed on’. It’s not especially intuitive, but we say the criterion is regressed on the predictor. Here, if we think speed is a key cause of stopping distance, we’d say ‘braking distance regressed on speed’ or ‘speed predicts braking distance.’

In formula terms, this is SATV ~ SATQ, which we pass as the first argument to lm().

lm_SAT <- lm(SATV ~ SATQ, data=sat.act)
summary(lm_SAT)
## 
## Call:
## lm(formula = SATV ~ SATQ, data = sat.act)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -365.9  -50.6   -3.2   54.7  257.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 227.1432    17.7798    12.8   <2e-16 ***
## SATQ          0.6312     0.0286    22.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 86.7 on 685 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.415,  Adjusted R-squared:  0.414 
## F-statistic:  486 on 1 and 685 DF,  p-value: <2e-16

The output contains individual parameter estimates of the model (here, just the intercept and slope), their standard errors, significance tests, and p-values (one degree of freedom). We also get global information such as the sum of squared errors and the coefficient of determination (\(R^2\)).

Regression diagnostics

We can also get useful diagnostic plots for free using the plot() function:

plot(lm_SAT)

The ggfortify package also provides an autoplot function that gives similar diagnostics within a handy ggplot-based graph.

autoplot(lm_SAT)

Bootstrap estimates and confidence intervals

Using functionality from the car and boot packges, we can easily get estimates of the regression coefficients and standard errors using nonparametric bootstrapping, which relaxes the normal theory assumption on the standard errors and, therefore, the significance tests. Likewise, the model does not assume normally distributed error.

Nonparametric bootstrapping approximates the sampling distribution for a statistic of interest (e.g., a slope) by resampling the existing data with replacement many times and examining the resulting density.

sat.act.na <- na.omit(sat.act)
lm_SAT.na <- lm(SATV ~ SATQ, data=sat.act.na)
system.time(lm_SAT.boot <- Boot(lm_SAT.na, R=2000))
##    user  system elapsed 
##   2.020   0.034   2.056
summary(lm_SAT.boot, high.moments=TRUE)
## 
## Number of bootstrap replications R = 2000 
##             original  bootBias  bootSE bootMed bootSkew bootKurtosis
## (Intercept)  227.143 -4.17e-02 20.8410 227.266   0.0513        0.198
## SATQ           0.631  7.07e-05  0.0325   0.631  -0.0430        0.191

We can use the object to obtain 95% bootstrapped confidence intervals using the ‘bias corrected, accelerated’ method (aka bca).

confint(lm_SAT.boot, level=.95, type="bca")
## Bootstrap bca confidence intervals
## 
##               2.5 %  97.5 %
## (Intercept) 187.015 269.539
## SATQ          0.565   0.696

And we can easily compare the bootstrapped and standard OLS models:

hist(lm_SAT.boot, legend="separate")

Multiple regression

We can easily extend to larger regression models by adding terms to the right side of the formula. For example, in the sat.act dataset, we could examine the extent to which the math scores (SATQ) is a function of both verbal scores (SATV), age (age), and sex (gender, 1 = male, 2 = female).

asv_model <- lm(SATQ ~ age + gender + SATV, sat.act)
summary(asv_model)
## 
## Call:
## lm(formula = SATQ ~ age + gender + SATV, data = sat.act)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -290.34  -50.32    5.81   51.51  295.57 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 273.5175    23.8614   11.46  < 2e-16 ***
## age          -0.1267     0.3498   -0.36     0.72    
## gender      -36.8580     6.9202   -5.33  1.4e-07 ***
## SATV          0.6541     0.0293   22.33  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 86.8 on 683 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.438,  Adjusted R-squared:  0.436 
## F-statistic:  178 on 3 and 683 DF,  p-value: <2e-16
sv_model <- lm(SATQ ~ gender + SATV, sat.act)
summary(sv_model)
## 
## Call:
## lm(formula = SATQ ~ gender + SATV, data = sat.act)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -291.27  -50.46    5.64   51.89  295.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 269.8997    21.6571   12.46  < 2e-16 ***
## gender      -36.8011     6.9140   -5.32  1.4e-07 ***
## SATV          0.6545     0.0293   22.37  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 86.8 on 684 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.438,  Adjusted R-squared:  0.437 
## F-statistic:  267 on 2 and 684 DF,  p-value: <2e-16

It appears that these are both influential predictors. We could examine the relationship graphically.

Visualizing regression data

ggplot(sat.act, aes(x=SATV, y=SATQ, color=factor(gender))) + 
  geom_point() + 
  scale_color_discrete(name="Sex", breaks = c("1", "2"), labels = c("Male", "Female")) +
  labs(title = "Multiple Regression", x = "SATV", y = "SATQ") +
  stat_smooth(method=lm, se=FALSE)

Getting results into a tidy, useful format

Note that the broom package is very useful for extracting global and specific statistics from many models in R, including regression models. The introductory vignette provides a number of useful examples: https://cran.r-project.org/web/packages/broom/vignettes/broom.html. Here, what if we want to save the global statistics and parameter estimates into data.frame objects?

We can use the glance function to get the global model statistics.

broom::glance(asv_model)
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC deviance df.residual
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl> <dbl> <dbl>    <dbl>       <int>
## 1     0.438         0.436  86.8      178. 3.52e-85     4 -4040. 8089. 8112. 5150990.         683

And the tidy function yields the parameter table

broom::tidy(asv_model)
## # A tibble: 4 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  274.      23.9       11.5   6.05e-28
## 2 age           -0.127    0.350     -0.362 7.17e- 1
## 3 gender       -36.9      6.92      -5.33  1.36e- 7
## 4 SATV           0.654    0.0293    22.3   2.51e-83

As can imagine (and saw earlier in the functional programming overview), the ability to extract regression statistics into a tidy data.frame is a boon to scaling your analyses to multiple models and datasets.

Modeling interactions

We can use the * operator in R to ask that both the constituent variables and their interaction(s) are entered into the model. For example:

int_model <- lm(SATQ ~ gender*age*SATV, sat.act)
summary(int_model)
## 
## Call:
## lm(formula = SATQ ~ gender * age * SATV, data = sat.act)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -293.41  -50.48    4.05   52.30  291.29 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -2.40e+02   1.85e+02   -1.29    0.196    
## gender           2.17e+02   1.08e+02    2.01    0.045 *  
## age              1.84e+01   7.19e+00    2.56    0.011 *  
## SATV             1.37e+00   2.96e-01    4.62  4.6e-06 ***
## gender:age      -8.86e+00   4.15e+00   -2.14    0.033 *  
## gender:SATV     -3.38e-01   1.73e-01   -1.96    0.051 .  
## age:SATV        -2.55e-02   1.16e-02   -2.20    0.028 *  
## gender:age:SATV  1.15e-02   6.68e-03    1.73    0.084 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 86.2 on 679 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.45,   Adjusted R-squared:  0.444 
## F-statistic: 79.4 on 7 and 679 DF,  p-value: <2e-16

This model includes individual effects of sex (gender) and age (age), as well as their interation (gender:age). This highlights that the asterisk operator * will compute all possible interations among the specified predictors. For example, a*b*c*d will generate all effets up through and including the a x b x c x d interation. By contrast, if you wish to specify a given interaction manually/directly, use the colon operator (e.g., a:b). The downside of the colon operator is that it doesn’t guarantee that the corresponding lower-level effects are included, which is usually a sane default position. As a reminder, you should essentially never include an interation without including the lower level effects, because this can misassign the variance.

#handy 2-way interation plotting function from jtools.
ggplot(data=sat.act, 
       aes(x=age, y=SATQ, colour = factor(gender))) +
  geom_jitter()+
  labs(x = "Age", y = "Math Score", colour = "Sex") +
  theme_bw()+
  theme(
    plot.background = element_blank()
    ,panel.grid.major = element_blank()
    ,panel.grid.minor = element_blank()
    ,panel.border = element_blank()
  ) +
  stat_smooth(method='lm', se=TRUE, fullrange=TRUE) +
  scale_color_manual(labels = c("Male", "Female"), values = c("#d21959", "#4aded3"))

theme(axis.text=element_text(size=12),
      axis.title=element_text(size=14))
## List of 2
##  $ axis.title:List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : num 14
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : num 12
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

What do we see? Males tend to have higher math scores and maintain these scores with age. Women have lower maths scores and show a decreasing trend as they get older.

Contrasts in regression

(Some of the code and text here has been adapted from Russell Lenth’s excellent emmeans documentation: https://cran.r-project.org/web/packages/emmeans/)

One of the handiest packages in the R regression universe is emmeans, which can provide the ‘expected marginal means’ (em means), as well as a host of other contrasts and comparisons. In particular, it is very easy to test simple slopes and pairwise differences. Furthermore, the package works with multcomp to handle correction for multiple comparisons. See the longer documentation here.

Let’s look at the concentration of leucine in a study of pigs who were fed differing levels of protein in the diet (9, 12, 15, and 18%) and different protein sources (fish, soybean, milk). The concentration has a long positive tail, so here we log transform it to normalize things somewhat.

sat.act$agef[sat.act$age < 25] <- 1
sat.act$agef[sat.act$age >= 25 & sat.act$age <= 50] <- 2
sat.act$agef[sat.act$age > 50] <- 3

sat.act2 <- sat.act
sat.act2$agef <- as.factor(sat.act2$agef)
sat.act2$gender <- as.factor(sat.act2$gender)
sat.lm <- lm(SATQ ~ agef + SATV, data = sat.act2)
summary(sat.lm)
## 
## Call:
## lm(formula = SATQ ~ agef + SATV, data = sat.act2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -301.62  -45.92    2.07   51.81  283.47 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 206.9031    18.8894   10.95   <2e-16 ***
## agef2        -0.0946     7.1350   -0.01     0.99    
## agef3        15.5437    18.9726    0.82     0.41    
## SATV          0.6579     0.0299   22.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 88.6 on 683 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.416,  Adjusted R-squared:  0.413 
## F-statistic:  162 on 3 and 683 DF,  p-value: <2e-16

This output is hard to look at because there are many dummy codes and we have to infer the reference condition for each factor (usually alphabetical). Also, we do not have an intuitive sense of the expected means in each condition because they depend on the sum of the intercept and the specific dummy code for the condition interest, averaging over the other factor.

We can obtain the expected means for each condition.

Expected means for protein source

sat.emm.s <- emmeans(sat.lm, "agef")
print(sat.emm.s)
##  agef emmean    SE  df lower.CL upper.CL
##  1       610  4.32 683      601      618
##  2       610  5.67 683      598      621
##  3       625 18.47 683      589      662
## 
## Confidence level used: 0.95

Expected means for protein level

sat.emm.p <- emmeans(sat.lm, "SATV")
print(sat.emm.p)
##  SATV emmean   SE  df lower.CL upper.CL
##   612    615 4.32 683      606      623
## 
## Results are averaged over the levels of: agef 
## Confidence level used: 0.95

Means in each cell of the factorial design

print(emmeans(sat.lm, ~agef*SATV))
##  agef SATV emmean    SE  df lower.CL upper.CL
##  1     612    610  4.32 683      601      618
##  2     612    610  5.67 683      598      621
##  3     612    625 18.47 683      589      662
## 
## Confidence level used: 0.95

Pairwise comparisons among protein sources

If we wanted to compare the pairwise differences in the effect of protein source on leucine concentration while controlling for protein percentage (and potentially other variables we add to the model), we could use the pairs function:

sat_pairs <- pairs(sat.emm.s)
print(sat_pairs)
##  contrast estimate    SE  df t.ratio p.value
##  1 - 2        0.09  7.13 683  0.013  1.0000 
##  1 - 3      -15.54 18.97 683 -0.819  0.6910 
##  2 - 3      -15.64 19.32 683 -0.809  0.6970 
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

Note that you can get a sense of the contrasts being tested by emmeans by examining the @linfct slot of the object. I’ve learned a lot by examining these contrast matrices and thinking about how to setup a (focal) contrast of interest. Also note that you get p-value adjustment for free (here, Tukey’s HSD method).

Contrasts for the predicted mean level of leucine contrast for each protein source, controlling for protein percentage.

sat.emm.s@linfct
##      (Intercept) agef2 agef3 SATV
## [1,]           1     0     0  612
## [2,]           1     1     0  612
## [3,]           1     0     1  612

What are the pairwise contrasts for the protein sources?

sat_pairs@linfct
##      (Intercept) agef2 agef3 SATV
## [1,]           0    -1     0    0
## [2,]           0     0    -1    0
## [3,]           0     1    -1    0

The emmeans package also provides useful plots to understand pairwise differences:

plot(sat.emm.s, comparisons = TRUE)

The blue bars are confidence intervals for the EMMs, and the red arrows are for the comparisons among them. If an arrow from one mean overlaps an arrow from another group, the difference is not significant, based on the adjust setting (which defaults to “tukey”). (Note: Don’t ever use confidence intervals for EMMs to perform comparisons; they can be very misleading.)

Pairwise differences and simple slopes in regression

Consider a model in which we examine the association between SATQ and SATV across age. Here, we regress SATV on SATQ, age (three levels), and their interaction.

fit_sat <- lm(SATQ ~ SATV * agef, data = sat.act2)
summary(fit_sat)
## 
## Call:
## lm(formula = SATQ ~ SATV * agef, data = sat.act2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -302.51  -50.21    2.09   54.61  256.86 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 178.8432    22.1853    8.06  3.4e-15 ***
## SATV          0.7034     0.0354   19.90  < 2e-16 ***
## agef2       109.8710    42.6641    2.58   0.0102 *  
## agef3        17.2861    95.7898    0.18   0.8568    
## SATV:agef2   -0.1804     0.0690   -2.61   0.0091 ** 
## SATV:agef3   -0.0022     0.1547   -0.01   0.9887    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 88.3 on 681 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.422,  Adjusted R-squared:  0.417 
## F-statistic: 99.3 on 5 and 681 DF,  p-value: <2e-16
car::Anova(fit_sat, type="III") #overall effects of predictors in the model
## Anova Table (Type III tests)
## 
## Response: SATQ
##              Sum Sq  Df F value  Pr(>F)    
## (Intercept)  506336   1   64.99 3.4e-15 ***
## SATV        3084321   1  395.85 < 2e-16 ***
## agef          51806   2    3.32   0.037 *  
## SATV:agef     53928   2    3.46   0.032 *  
## Residuals   5306050 681                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that this yields a categorical (species) x continuous (petal width) interaction. The output from car::Anova indicates that the interaction is significant, but we need more detailed guidance on how the slope for petal width is moderated by species. We can visualize the interaction as follows:

ggplot(data=sat.act2, 
       aes(x=SATV, y=SATQ, colour = factor(agef))) +
  geom_jitter()+
  labs(x = "Verbal Score", y = "Math Score", colour = "Age") +
  theme_bw()+
  theme(
    plot.background = element_blank()
    ,panel.grid.major = element_blank()
    ,panel.grid.minor = element_blank()
    ,panel.border = element_blank()
  ) +
  stat_smooth(method='lm', se=FALSE, fullrange=TRUE) +
  scale_color_manual(labels = c("Under 25", "25-50", "Over 50"), values = c("red", "green", "blue"))

theme(axis.text=element_text(size=12),
      axis.title=element_text(size=14))
## List of 2
##  $ axis.title:List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : num 14
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : num 12
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

In a simple slopes test, we might wish to know whether the slope for Petal.Width is non-zero in each species individually. Let’s start by getting the estimated marginal means for each species.

emmeans(fit_sat, ~agef)
## NOTE: Results may be misleading due to involvement in interactions
##  agef emmean    SE  df lower.CL upper.CL
##  1       610  4.31 681      601      618
##  2       609  5.66 681      598      620
##  3       626 18.43 681      589      662
## 
## Confidence level used: 0.95

And pairwise differences between species:

pairs(emmeans(fit_sat, ~agef))
## NOTE: Results may be misleading due to involvement in interactions
##  contrast estimate    SE  df t.ratio p.value
##  1 - 2        0.62  7.11 681  0.087  0.9960 
##  1 - 3      -15.94 18.92 681 -0.842  0.6770 
##  2 - 3      -16.56 19.28 681 -0.859  0.6660 
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

Transitioning to SATV, because we are interested its linear effect (slope), we use the emtrends function to estimate the slope in each species individually. In terms of simple slopes, we test whether the SATV slope is non-zero in each age group. The infer argument in the summary of emtrends requests t-tests and p-values for the slopes.

summary(emtrends(model = fit_sat, ~agef, var="SATV"), infer=TRUE)
##  agef SATV.trend     SE  df lower.CL upper.CL t.ratio p.value
##  1         0.703 0.0354 681    0.634    0.773 19.900  <.0001 
##  2         0.523 0.0593 681    0.407    0.639  8.820  <.0001 
##  3         0.701 0.1506 681    0.406    0.997  4.660  <.0001 
## 
## Confidence level used: 0.95

Finally, we could examine pairwise differences between slopes among age groups.

pairs(emtrends(model = fit_sat, ~agef, var="SATV"))
##  contrast estimate    SE  df t.ratio p.value
##  1 - 2      0.1804 0.069 681  2.614  0.0250 
##  1 - 3      0.0022 0.155 681  0.014  1.0000 
##  2 - 3     -0.1782 0.162 681 -1.101  0.5140 
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

A few other emmeans features

In the sat.act dataset, we have treated age as a factor. If we keep this representation (as opposed to entering age as continuous), we can easily get orthogonal polynomial contrasts in emmeans. For example, is the effect of age linearly related to SATV, or might it be quadratic or cubic?

sat.emm.p <- emmeans(sat.lm, "agef")
ply <- contrast(sat.emm.p, "poly")
ply
##  contrast  estimate   SE  df t.ratio p.value
##  linear        15.5 19.0 683 0.819   0.4130 
##  quadratic     15.7 22.1 683 0.712   0.4770
coef(ply) #show the contrast coefficients
##   agef c.1 c.2
## 1    1  -1   1
## 2    2   0  -2
## 3    3   1   1

There is a lot more on probing interations here: https://cran.r-project.org/web/packages/emmeans/vignettes/interactions.html.

Finally, we can examine effects in multivariate regression models (i.e., multiple DVs). Here, we can examine the both the verbal and math scores and if they are associated with age, gender (sex), education, or ACT scores.

org.int <- lm(cbind(SATV, SATQ) ~ age + gender + education + ACT, data = sat.act2)
summary(org.int)
## Response SATV :
## 
## Call:
## lm(formula = SATV ~ age + gender + education + ACT, data = sat.act2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -502.0  -47.9    7.6   60.1  239.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 261.2947    23.3033   11.21   <2e-16 ***
## age          -1.3893     0.4537   -3.06   0.0023 ** 
## gender2      -0.0713     7.5002   -0.01   0.9924    
## education     1.4926     3.0573    0.49   0.6256    
## ACT          13.3790     0.7487   17.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 93.3 on 682 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.326,  Adjusted R-squared:  0.322 
## F-statistic: 82.3 on 4 and 682 DF,  p-value: <2e-16
## 
## 
## Response SATQ :
## 
## Call:
## lm(formula = SATQ ~ age + gender + education + ACT, data = sat.act2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -492.5  -48.0    8.2   63.3  257.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  259.647     22.891   11.34  < 2e-16 ***
## age           -1.352      0.446   -3.03   0.0025 ** 
## gender2      -34.805      7.367   -4.72  2.8e-06 ***
## education      1.102      3.003    0.37   0.7138    
## ACT           14.155      0.735   19.25  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 91.7 on 682 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.375,  Adjusted R-squared:  0.372 
## F-statistic:  102 on 4 and 682 DF,  p-value: <2e-16

ANOVA

The overall goal is to give you a very quick introduction to conducting one-way ANOVA and two-way ANOVA in R.

Analysis of variance (ANOVA) provides a statistical test of whether two or more population means are equal. It is based on the law of total variance, where the observed variance in particular variable is partitioned into components attributable to differnt sources of variation. When the population means are equal, the differences among the group means reflect the operation of experimental error alone (within-groups deviation). When the population means are not equal, the differences among the group means will reflect the operation of both experimental error (within-groups deviation) and treatment effects (between-group deviation). In ANOVA, we produce the ratio of:

\[\frac{S^2_{treatment} + S^2_{error}}{S^2_{error}}\]

Variance = (sum of square deviation from the mean)/(degrees of freedom)=SS/df Sum of Squares: \[SS_{Total}= SS_{between}+SS_{within}\] \[\sum(Y_{i.j}-\bar{Y_T})= \sum(\bar{Y_{ai}-\bar{Y_T})+\sum(Y_{i.j}-\bar{Y_{ai})\] Degrees of freedom: \[df_{Total}= df_{between}+df_{within}\]


For a single factor

The one-way ANOVA is an extension of independent two-sample t-test for comparing means in a situation where there are two groups or more than two groups. In one-way ANOVA, the data is organized into several groups base on one single grouping variable (also called factor variable).

One-way ANOVA Assumptions:

- Independence of observations
- Normality (normal distributions of the residuals)
- Homoscendasticity (equal variances for all groups)
  1. Normality
df<-ToothGrowth
df$dose <- factor(df$dose, 
                  levels = c(0.5, 1, 2),
                  labels = c("D0.5", "D1", "D2"))
kruskal.test(len ~ supp, data = df)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  len by supp
## Kruskal-Wallis chi-squared = 3, df = 1, p-value = 0.06

From the output above we can see that the p-value is not less than the significance level of 0.05.

  1. Homogeneity of variances
leveneTest(len ~ supp, data = df)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1    1.21   0.28
##       58

From the output above we can see that the p-value is not less than the significance level of 0.05. This means that there is no evidence to suggest that the variance across groups is statistically significantly different. Therefore, we can assume the homogeneity of variances in the different treatment groups.

table(df$supp) # there are three levels
## 
## OJ VC 
## 30 30
group_by(df, supp) %>%
  summarise(
    count = n(),
    mean = mean(len, na.rm = TRUE),
    sd = sd(len, na.rm = TRUE)
  )
## # A tibble: 2 x 4
##   supp  count  mean    sd
##   <fct> <int> <dbl> <dbl>
## 1 OJ       30  20.7  6.61
## 2 VC       30  17.0  8.27
# Compute the analysis of variance
res.aov <- aov(len ~ supp, data = df)
# Summary of the analysis
summary(res.aov)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## supp         1    205     205    3.67   0.06 .
## Residuals   58   3247      56                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As the p-value is more than the significance level 0.05, we can conclude that there are no significant differences between the groups.

Box plots

Plot weight by group and color by group

ggplot(df, aes(x=supp, y=len)) + 
  geom_boxplot(outlier.colour="red", outlier.shape=8,outlier.size=4) +
  geom_jitter(shape=16, position=position_jitter(0.2))

Check the homogeneity of variance assumption

plot(res.aov, 1)

In the plot below, there is no evident relationships between residuals and fitted values (the mean of each groups), which is good. Points 23, 26 are detected as outliers, which can severely affect normality and homogeneity of variance. It can be useful to remove outliers to meet the test assumptions.

Check the normality assumption

plot(res.aov, 2)

QQ-plot is used to check the assumption that the residuals are normally distributed. It should approximately follow a straight line.

Multiple pairwise-comparison between the means of groups

In one-way ANOVA test, a significant p-value indicates that some of the group means are different, but we don’t know which pairs of groups are different. It’s possible to perform multiple pairwise-comparison, to determine if the mean difference between specific pairs of group are statistically significant. Example: Tukey HSD (Tukey Honest Significant Differences, R function: TukeyHSD())

TukeyHSD(res.aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = len ~ supp, data = df)
## 
## $supp
##       diff   lwr   upr p adj
## VC-OJ -3.7 -7.57 0.167  0.06

diff: difference between means of the two groups lwr, upr: the lower and the upper end point of the confidence interval at 95% (default) p adj: p-value after adjustment for the multiple comparisons.


For two factors

Two-way ANOVA test is used to evaluate simultaneously the effect of two grouping variables (A and B) on a response variable. The number of level may vary in each factor.

Two-way ANOVA test Assumptions:

  1. There is no difference in the means of factor A
  2. There is no difference in the means of factor B
  3. There is no interaction between factors A and B
  4. Normality (normal distributions of the residuals )
  5. Homoscendasticity (equal variances for all groups) The alternative hypothesis for cases 1 and 2 is: the means are not equal. The alternative hypothesis for case 3 is: there is an interaction between A and B.
table(df$supp, df$dose)
##     
##      D0.5 D1 D2
##   OJ   10 10 10
##   VC   10 10 10
group_by(df, supp, dose) %>%
  summarise(
    count = n(),
    mean = mean(len, na.rm = TRUE),
    sd = sd(len, na.rm = TRUE)
  )
## # A tibble: 6 x 5
## # Groups:   supp [2]
##   supp  dose  count  mean    sd
##   <fct> <fct> <int> <dbl> <dbl>
## 1 OJ    D0.5     10 13.2   4.46
## 2 OJ    D1       10 22.7   3.91
## 3 OJ    D2       10 26.1   2.66
## 4 VC    D0.5     10  7.98  2.75
## 5 VC    D1       10 16.8   2.52
## 6 VC    D2       10 26.1   4.80

In two-way ANOVA, we assume the sample sizes within cells are equal. If not, the ANOVA test should be handled differently (Type-III sums of squares).

We want to know if tooth length depends on supp and dose.

res.aov2 <- aov(len ~ supp + dose, data = df)
summary(res.aov2)
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## supp         1    205     205    14.0 0.00043 ***
## dose         2   2426    1213    82.8 < 2e-16 ***
## Residuals   56    820      15                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA table we can conclude that both supp and dose are statistically significant, highlighted with *. Dose is the most significant factor variable. These results would lead us to believe that changing delivery methods (supp) or the dose of vitamin C, will impact significantly the mean tooth length.

Note the above fitted model is called additive model. It makes an assumption that the two factor variables are independent. If you think that these two variables might interact to create an synergistic effect, replace the plus symbol (+) by an asterisk (*), as follow.

res.aov3 <- aov(len ~ supp * dose, data = df)
summary(res.aov3)
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## supp         1    205     205   15.57 0.00023 ***
## dose         2   2426    1213   92.00 < 2e-16 ***
## supp:dose    2    108      54    4.11 0.02186 *  
## Residuals   54    712      13                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It can be seen that the two main effects (supp and dose) are statistically significant, as well as their interaction.

Line plots with multiple groups

Plot tooth length (“len”) by groups (“dose”) Color box plot by a second group: “supp”

data_summary <- function(data, varname, groupnames){
  require(plyr)
  summary_func <- function(x, col){
    c(mean = mean(x[[col]], na.rm=TRUE),
      sd = sd(x[[col]], na.rm=TRUE))
  }
  data_sum<-ddply(data, groupnames, .fun=summary_func,
                  varname)
  data_sum <- rename(data_sum, c("mean" = varname))
 return(data_sum)
}
df3 <- data_summary(df, varname="len", 
                    groupnames=c("supp", "dose"))
## Loading required package: plyr
## ----------------------------------------------------------------------------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ----------------------------------------------------------------------------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:Hmisc':
## 
##     is.discrete, summarize
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise, summarize
## The following object is masked from 'package:purrr':
## 
##     compact
ggplot(df3, aes(x=dose, y=len, group=supp, color=supp)) + 
    geom_errorbar(aes(ymin=len-sd, ymax=len+sd), width=.1, 
    position=position_dodge(0.05)) +
    geom_line() + geom_point()+
   scale_color_brewer(palette="Paired")+theme_minimal()

Multiple pairwise-comparison between the means of groups

It is to determine if the mean difference between specific pairs of group are statistically significant.

TukeyHSD(res.aov3, which = "dose")
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = len ~ supp * dose, data = df)
## 
## $dose
##          diff   lwr   upr p adj
## D1-D0.5  9.13  6.36 11.90     0
## D2-D0.5 15.50 12.73 18.26     0
## D2-D1    6.37  3.60  9.13     0

Alternative:

t-test: comparing means between two groups in a sample (t.test function) Linear regression: comparing one or more factors in a sample (lm or lmer function)

Hands on practice: data analyses

For this hands on practice, load the spi data set from the psych package.

  1. Check the help file, structure, and first few observations of the data

  2. Change the integer categorical variables to factors by cross-referencing the help page and filling in the pipeline below (note when knitting this document, you may want to change eval to TRUE once you complete this exercise)

spi <- spi %>%
  mutate(.,
    sex_fac = case_when(),
    health_fac = case_when(),
    education_fac = case_when(),
    smoke_fac = case_when(),
    exer_fac = case_when()
  )

Descriptives and Chi-squared

  1. Use either describe() or skim() to describe the data by each categorical variable

  2. Describe the data by both categorical variables at once

  3. Perform a chi-squared test assessing whether there is a differences in smoking by education level

Correlation

  1. Get the covariance and correlation matrix for the relationship between three numeric values of your choosing

  2. Perform a test of significance on two of the variables selected above

Linear Models

ANOVAs